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Abstract 

The estimation of the correlation between time series is often hampered by the 
asynchronicity of the signals. Cumulating data within a time window suppresses 
this source of noise but weakens the statistics. We present a method to estimate 
correlations without applying long time windows. We decompose the correlations of 
data cumulated over a long window using decay of lagged correlations as calculated 
from short window data. This increases the accuracy of the estimated correlation 
significantly and decreases the necessary efforts of calculations both in real and 
computer experiments. 
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Introduction 



Correlations between time series are fundamental for understanding and inter- 
preting stochastic processes. Very often the time between the signals of a series 
is distributed in an uneven fashion, causing asynchronicity of the compared 
series. 

Correlations between asynchronous signals can be of great importance in sev- 
eral areas. An important example is the case of neutron activation analysis 
[1]. These experiments are used for non-destructive testing of materials in or- 
der to determine the concentration of their constituents. In the analysis the 
specimen is bombarded by neutrons coming from a source, causing the ele- 
ments to form radioactive isotopes, and from the spectra of the emissions of 
the radioactive sample, the concentration of the elements can be determined. 
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Since the radiation appears together for all kinds of atoms, different elements 
are going to radiate in a correlated but asynchronous way pl2] . 

Another example can be taken from materials science or seismology. Mechan- 
ical failures can be tested by their wave radiation and it is crucial to know 
if the signals measured by different sensors are correlated [Sj. Obviously, the 
analysis involves the handling of asynchronous signals. 

Correlations of returns of different companies are fundamental input data for 
portfolio optimisation. As the transactions are asynchronous, the correlations 
measured on short time scale are significantly reduced, which is called the 
Epps effect 

From theoretical point of view continuous time random walks [2] can be men- 
tioned, that can be used to describe a broad range of processes from transport 
in disordered solids [6] to finance |^. Correlated continuous random walks 
produce asynchronous time series. 

When computing the Pearson correlation coefficient of two stationary signals 
we often have to face an important problem: The correlation measure is de- 
signed to determine the grade of co-movements of synchronous observations, 
while the signals are asynchronous. The usual way to handle this problem is 
to cumulate data over a time window At and look for the correlations be- 
tween these binned data. In order to approach the asymptotic, proper value 
of the correlation coefficient. At should be much larger than the scale of asyn- 
chronicity. However, this leads to the reduction of the statistics, consequently 
it makes the estimates inaccurate. On the other hand, for short At the noise 
due to asynchronicity may reduce the measured correlations significantly. 

It has been suggested to use measures of correlation other than the Pearson co- 
efficient to overcome the problem of asynchronicity. Ref. [S] presents a method 
of measuring covariance based on Fourier series analysis of data. This method 
has been applied by Refs. PITO] in the study of financial correlations. While 
the Fourier method is indeed somewhat less sensitive to asynchronicity, the 
problem cannot be eliminated by its use. Refs. [11] propose a new estimator 
of the covariance of two diffusion processes that are observed only at discrete 
times in a non-synchronous manner. Their estimator uses all available data 
and does not require synchronization of observations, however in the presence 
of noise it becomes inconsistent and its variance diverges. A good comparison 
of several covariance estimators can be found in Refs p^flS] . 

In this paper we describe an estimator of the Pearson correlation coefficient for 
correlated, asynchronous pairs of data based on an appropriate decomposition 
of the expression for the correlation coefficient for large At by using the value 
of the coefficient for small time window and decay of lagged correlations. 
The latter can be calculated from the good statistics high resolution data. 
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We applied the decomposition scheme already to explain [H] the Epps effect 
[1]. Here we show how it can be used to accurately estimate the asymptotic 
correlation coefficients. 

The paper is organized as follows: In Section [2] we review the decomposition 
of correlations. Section [3] contains the description of our method together with 
some demonstrative results. In Section H] we apply the method to real data. 
We end the paper with a short summary in Section [5l 



2 Decomposition of correlations 



Let us consider a system, where discrete stationary signals arrive from two 
different sources (A and B) at different time instances resulting in two cor- 
related time series. We count the hits arriving, and denote their cumulative 
number measured from a reference time t = at time t by c^(t) and c^(t) 
respectively. 

We are interested in the correlation between the number of hits arriving to 
our sensor in a certain time window, thus the change in and c^. This will 
be denoted by 

rUt) = c^{t)-c^{t-At). (1) 
The general Pearson correlation measure with time lag r is defined by 

Cl^'ir) = \ ""''^T (2) 

where ArAt(t) is the deviation of r^t{t) from its mean. 



1 ^ 

= T7EM*(^Ai)A^f*(^Ai + r), (3) 
i=i 

with = [(T — t)/ At] and cr^ (c"^) is the standard deviation of r^^ ('"aJ- 

We use (. . .) for denoting time average. The equal-time correlation coefficient 
is naturally: p^i^ = ^M^i'^ — 0)- 

As we can see, already by the definition, the length of the time window, i.e. 
the value of At plays a major role in measuring the correlation. 
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Assuming At = nAto with n being a positive integer, we can deduce the 
following relationship between correlations on the two different time scales 




\ x=—n+l 



( 



) 



x=—n+l 



Details on the deduction of Equation |4] can be found in the Appendix. It is 
plausible to set Ato as the shortest meaningful time scale in the system, that 
has to be chosen with the actual problem in mind. 

This decomposition can be used to accurately estimate correlations by using 
high resolution, i.e., good statistics data. 



3 Method 

Expression (jlj) enables to calculate the correlation coefficient for any sampling 
time scale. At, by knowing the coefficient on a shorter sampling time scale, 
Ato, and the decay of lagged correlations on the same shorter sampling time 
scale (given that At is multiple of Ato)- 

The method we would like to propose relies on this decomposition of correla- 
tions. 

We suggest the following procedure. Data should be binned with a small Ato 
such that a good statistics is achieved, irrespective of the fact that noise due 
to asynchronicity may be considerable. Then the correlation functions and the 
decay of lagged correlations should be calculated using these data (of course 
the calculated correlation can be expected to be too small). Plugging in these 
quantities into Equation (jl]) with a large enough At we obtain a good estimate 
for the proper correlation. Using different values of At an extrapolation to the 
proper, asymptotic correlation coefficient is possible. 

We demonstrate the method in more details in this section. We use correlated 
random walks and show that in case of directly measuring the correlations in 
large At time windows we need very long time series in order to have a good 
estimate for correlations. When using the decomposition method, we use high 
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Fig. 1. (Color online) Illustration of the asynchronous sampling, as introduced in 
|14j . The original random walk is shown with lines (black), the two sampled series 
with triangles and line (red) and dots and line (blue). 

resolution data and thus can achieve an estimate for the correlations with the 
same accuracy from a dataset of much shorter time span. 



3. 1 Demonstration 

We construct correlated asynchronous time series in the following way: As a 
first step we generate a core random walk with unit steps up or down in each 
second: 

W{t)=W{t-l)+e{t), (5) 

where e{t) is ±1 with equal probability. Second we sample the random walk, 
W(t), twice independently with sampling time intervals drawn from some dis- 
tribution. This way we simulate asynchronicity by non-simultaneously sam- 
pling our generated data. A snapshot of the random walks can be seen in 
Figure [H 

Below we present results first for exponentially distributed sampling intervals 
between steps and second for the sampling intervals between steps being drawn 
from a WeibuU distribution. We study the correlation between the changes in 
the position of the two random walkers. 
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Fig. 2. (Color online) Comparison of the directly measured correlation coefficients 
and the coefficients determined through the decomposition method in case of ex- 
ponentially distributed sampling intervals. In blue we show the average and the 
standard deviation of the direct measurements computed over the ensemble of 50 
time series pairs. In red we show the average and the standard deviation obtained 
by the decomposition method, computed over the ensemble of 50 time series pairs. 

3.1.1 Exponential sampling intervals 

We generate 50 pairs of time series, as described above, with sampling intervals 
between consecutive changes from the distribution: 



with parameter A = 1/60. Each time series has a length of 25000 time steps. 
Naturally, since the time series are finite, the correlation coefficients that we 
measure will have errors. In Figure [2] we show the results for the correlation 
coefficients on different time scales, where the shortest time scale used was 
Ato = 10. 

In blue we show the average of the direct measurements, taken over 50 points. 
The errorbars show the standard deviation of the points. In red we show the 
average result obtained by the decomposition method using Equation HI taken 
over 50 points, with their standard deviation as errorbars. As we can see there 




(6) 
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is a significant difference between the errors of the two measurements, while 
their means are very near to each other. In general, the error of measurements 
goes as (t/a/ZV, where a is the standard deviation of the distribution of results 
and N is the number of data points. The ratio of the standard deviations at 
At = 1000 in Figure [2] is close to 3. This means that in order to obtain the 
same precision from direct measurements as from the decomposition method, 
we need roughly one order of magnitude more data points. 

Generally, one is interested in the asymptotic value of the correlations, i.e. in 
the limit of pAt^oo- As we can see, even for the numerically generated data, 
determining the correlation for the scale of At = 1000 we still get a correlation 
different from the underlying asymptotic correlation, that is 1. A possible way 
to determine the asymptotic correlation value is using some extrapolation 
method. 

We demonstrate that applying a simple extrapolation method for the gener- 
ated data we can determine the exact underlying correlation with very good 
accuracy. Since we are interested in the At — > cx) value, we use the plot of 
PAt as a function of 1/At for the extrapolation. Figure [3] shows the correlation 
points and the curve determined by piecewise Cubic Hermite Interpolation 
method. The extrapolated curve intercepts the y-axis at the value of 1.002, 
which is very close to the actual asymptotic correlation value, with an error 
around 0.2%. 

We studied the effect of the errors of the two methods on the accuracy of the 
extrapolated value of the correlation coefficient. Applying a piecewise Cubic 
Hermite Interpolation to the endpoints of the error bars, we find that the 
extrapolated value of the asymptotic correlations using direct measurements 
falls between 0.979 and 1.035, while the value for the decomposition method 
falls between 0.998 and 1.005, indicating a factor of 8 improvement in the 
precision. 

3.1.2 Weihull sampling intervals 

In order to demonstrate the power of the method for a non Poisson process we 
generate 50 pairs of time series, as described above, with sampling intervals 
between consecutive changes generated from a WeibuU distribution: 



with parameters a = 20 and h = 0.7. Again each time series has a length 




(7) 
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Fig. 3. (Color online) The correlation as a function of 1/At for the exponentially dis- 
tributed sampling intervals. The circles show the correlations determined using the 
decomposition method, and the (red) curve is the piecewise Cubic Hermite Interpo- 
lation of the correlation values. The extrapolation gives an asymptotic correlation 
value of 1.002, very close to the actual underlying value, that is 1. 

of 25000 time steps and the directly measured correlation values have large 
variance. In Figure H] we show the results for the correlation coefficients on 
different time scales, again with Ato = 10. 

The Figure shows that we get a better estimate of the means from the de- 
composition formula than from direct measurements. In this case the ratio of 
the standard deviations at At = 1000 is close to 2. The same improvement 
is obtained for the precision of the extrapolated asymptotic correlation coef- 
ficients. The extrapolation of the asymptotic coefficient can be seen in Figure 



The two above examples on generated time series show that using our method 
and estimating the correlation coefficient between asynchronous signals from 
the high frequency data leads to much smaller variation of the results than in 
case of direct measurement of correlations on lower frequency data. 



3.2 Decay functions 

As we can see from Equation (j4]) a subtle point of the correlation estimation 
is the measurement of the decay of lagged correlations on the short (Ato) 
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Fig. 4. (Color online) Comparison of the directly measured correlation coefficients 
and the coefficients determined through the decomposition method in case of 
Weibull distributed sampling intervals. In blue we show the average and the stan- 
dard deviation of the direct measurements computed over the ensemble of 50 time 
series pairs. In red we show the average and the standard deviation obtained by the 
decomposition method, computed over the ensemble of 50 time series pairs. 

time scale, that we will call decay functions. There are three decay func- 



tions we have to measure: ('"Ato(^)'"Ato(^ + xAto)\ \r^t^^{t)r^.{t + xAto)) and 



^Ato(^)'^Afo(^ + ■'^^^o))- To have good estimation of the asymptotic correla- 



tion value, one has to have precise measurements of these decay functions. 

In the above examples we had random walks as underlying processes, thus 
by definition the autocorrelations are delta functions. However, because of the 
asynchronous sampling, the cross-correlation is being smeared out and instead 
of a delta function we have finite decay of the cross-correlation. These cases are 
simple from the point of view of the decay functions. To demonstrate that in 
case of more complicated vanishing decay functions the decomposition can still 
give a good estimation of the asymptotic correlation, we consider the following 
time series. We generate a persistent random walk [T5|[T6] . ie. a walk, where 
the probability, a, of jumping in the same direction as in the previous step is 
higher than 0.5. Then, as we have done before, we sample the persistent ran- 
dom walk twice independently with sampling intervals drawn from a Weibull 
distribution (again with parameters a = 20 and b = 0.7). This construction 
generates slowly vanishing decay functions (that would be exponentially de- 
caying without the asynchronous sampling). Again we generate 50 pairs of 
time series, each being 25000 steps long, the persistency is a = 0.999. Figures 
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Fig. 5. (Color online) The correlation as a function of 1/At for the Weibull dis- 
tributed sampling intervals. The circles show the correlations determined using the 
decomposition method, and the (red) curve is the piecewise Cubic Hermite Interpo- 
lation of the correlation values. The extrapolation gives an asymptotic correlation 
value of 0.992, very close to the actual underlying value, that is 1. 

E] and [7] shows two of the decay functions (the decay of the autocorrelations 
are identical so we only show one of them). Here we set Ato = 50. 

In Figure [H] we show the results for the correlation coefficients on different time 
scales. The decomposition method gives good results in this case too. The ratio 
of the standard deviations at At = 1000 is close to 3.5, signaling that in order 
to obtain the same precision, we need roughly one order of magnitude more 
data points in case of direct measurements than for the decomposition method. 
Figure [9] shows the extrapolation to the asymptotic value of the correlation, 
using piecewise Cubic Hermite Interpolation method. The extrapolated curve 
intercepts the y-axis at the value of 1.002. Applying the extrapolation to 
the endpoints of the error bars, comparing the direct measurements and the 
decomposition results, we find a factor of 20 improvement in the precision. 



4 Demonstration on real data 

As discussed in the introduction, an example of correlated asynchronous sig- 
nals is the case of stock market data. Price changes for different assets on 
the market appear in an asynchronous manner, however, it is well known that 
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Fig. 6. (Color online) The decay function, {f'^to^^^^Ato^^ ~^ xAtQ)"^ in case of the per- 
sistent random walk, sampled with Weibull sampling intervals. The autocorrelations 
decrease slowly. 
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Fig. 7. (Color online) The decay function, {r^^ (^)^At (i + 2;Ato)) in case of the 
persistent random walk, sampled with Weibull sampling intervals. The cross-corre- 
lations decrease slowly. 
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Fig. 8. (Color online) Comparison of the directly measured correlation coefficients 
and the coefficients determined through the decomposition method in case of the 
persistent random walk, with Weibull sampling. In blue we show the average and 
the standard deviation of the direct measurements computed over the ensemble of 
50 time series pairs. In red we show the average and the standard deviation obtained 
by the decomposition method, computed over the ensemble of 50 time series pairs. 

there are important correlations between the price changes. In order to demon- 
strate our method on real world data, in this Section we show how it can be 
used to estimate financial correlations. 

We took data for Coca-Cola and Pepsi, a pair of stocks with strong correlations 
from the high frequency Trade and Quote (TAQ) Database of the New York 
Stock Exchange (NYSE) for the year 2000. We computed the logarithmic 
returns of stock prices: 



where p^{t) stands for the price of stock A at time t. The prices were de- 
termined using previous tick estimator on the high frequency data, i.e. prices 
are defined constant between two consecutive trades. What we study is the 
cross-correlation coefficient between the data of different stocks. 

We divided data for the year 2000 into 50 disjoint periods of 5 days (weekly 
periods) and measured the correlations on these time intervals. This way we 
handle the separate weeks independently, similar to the case of the generated 
data: We have 50 time series pairs and can study both the directly calculated 
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Fig. 9. (Color online) The correlation as a function of 1/Ai for the persistent random 
walk, with Weibuh sampling. The circles show the correlations determined using the 
decomposition method, and the (red) curve is determined by using piecewise Cubic 
Hermite Interpolation method. The extrapolation gives an asymptotic correlation 
value of 1.002. 

correlation coefficients, both the coefficients obtained through the decomposi- 
tion method. 

Figure [TD] shows the results for the correlation coefficients on different time 
scales. We can see that the standard deviation of the coefficients obtained 
through the decomposition are much lower than those of the direct measure- 
ments: The ratio of the standard deviations at At = 6000 seconds is 4.6 
signaling that in order to obtain the same precision, we need roughly 20 times 
more data points in case of direct measurements than for the decomposition 
method. 



5 Summary 

In this paper we discussed the problem of estimating the correlation coefficient 
between two asynchronous signals. While the direct use of high resolution data 
results in an underestimation of the correlations, coarser binning of the the 
data leads to larger errors due to loss of data. 

We proposed a method, which enables to estimate the asymptotic value of 
correlations from the high frequency data, without the need of using longer 
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Fig. 10. (Color online) Comparison of the directly measured correlation coefficients 
and the coefficients determined through the decomposition method in case of real 
world stock market data. The correlations are computed for the stock pair KO /PEP 
for the year 2000. In blue we show the average and the standard deviation of the 
direct measurements computed over the ensemble of 50 weekly time series pairs. In 
red we show the average and the standard deviation obtained by the decomposition 
method, computed over the ensemble of 50 weekly time series pairs. 

time scales and thus without using worse statistics. The correlations from the 
high frequency data can be determined very accurately, based on the good 
statistics. We demonstrated our method on generated data sets, showing that 
the error of correlations determined by our method is much smaller than the 
errors of correlations measured directly, using long time windows. Extrapolat- 
ing to the asymptotic correlation from the determined correlation values leads 
to a very accurate estimation of the underlying correlation. A very important 
question in the estimation of the asymptotic correlation value is the determi- 
nation of the shortest meaningful time scale, Ato, on which we measure the 
decay functions. The asynchronicity of the signals slows down the decrease of 
the decay functions. In the paper we showed that also in case of non-trivial 
decay functions the decomposition gives a good estimation of the asymptotic 
correlation value. 

We demonstrated how the method works for real data. When studying weekly 
cross-correlations of stock returns we showed that the precision of the coef- 
ficients obtained through our method is much higher than that of the direct 
measurements. 
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A The decomposition of correlations 

We can write the correlation coefficient from Equation [2] in the following form: 



A/B 



ritityUt)) - (ritit)) (riM) ^^^^ 



We assume two time scales: At and Ato where At = nAto, with n being a 
positive integer. The change in the measured quantity in the time window At 
is the mere sum of changes in shorter, non-overlapping time windows Ato: 



rAtW = E'^M,(t- A^ + sAto). (A.2) 
Using this relationship the time average can be written in the following form: 



1 ^ 



= T.T. (^A*„(t - At + .Ato)rf,„(t - At + gAto)) . (A.3) 

s=lq=l 

The sum in Eq. IA.3I can be written in the following way for stationary signals: 



n-l 

rUt)rUt))= E {n-\x\){ri,^{t)rl,^{t + xAt,)), (A.4) 

x=— n+l 

and similarly 



n.-l 



[rUtf)= E (n-|x|)(r^,„(tX^(t + xAto)) 



x=—n+l 
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n-1 



(A.5) 



x=—n+l 



On the other hand, for stationary signals, Equation IA.2I leads to 



{rAt{t)) =n(rAto(t)). 



(A.6) 



Combining the above equations we can deduce a relationship between the 
correlation coefficients measured on two different sampling time scales and we 
get Equation m 
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